An interpretable machine learning model for predicting 28-day mortality in patients with sepsis-associated liver injury

Sepsis-Associated Liver Injury (SALI) is an independent risk factor for death from sepsis. The aim of this study was to develop an interpretable machine learning model for early prediction of 28-day mortality in patients with SALI. Data from the Medical Information Mart for Intensive Care (MIMIC-IV, v2.2, MIMIC-III, v1.4) were used in this study. The study cohort from MIMIC-IV was randomized to the training set (0.7) and the internal validation set (0.3), with MIMIC-III (2001 to 2008) as external validation. The features with more than 20% missing values were deleted and the remaining features were multiple interpolated. Lasso-CV that lasso linear model with iterative fitting along a regularization path in which the best model is selected by cross-validation was used to select important features for model development. Eight machine learning models including Random Forest (RF), Logistic Regression, Decision Tree, Extreme Gradient Boost (XGBoost), K Nearest Neighbor, Support Vector Machine, Generalized Linear Models in which the best model is selected by cross-validation (CV_glmnet), and Linear Discriminant Analysis (LDA) were developed. Shapley additive interpretation (SHAP) was used to improve the interpretability of the optimal model. At last, a total of 1043 patients were included, of whom 710 were from MIMIC-IV and 333 from MIMIC-III. Twenty-four clinically relevant parameters were selected for model construction. For the prediction of 28-day mortality of SALI in the internal validation set, the area under the curve (AUC (95% CI)) of RF was 0.79 (95% CI: 0.73–0.86), and which performed the best. Compared with the traditional disease severity scores including Oxford Acute Severity of Illness Score (OASIS), Sequential Organ Failure Assessment (SOFA), Simplified Acute Physiology Score II (SAPS II), Logistic Organ Dysfunction Score (LODS), Systemic Inflammatory Response Syndrome (SIRS), and Acute Physiology Score III (APS III), RF also had the best performance. SHAP analysis found that Urine output, Charlson Comorbidity Index (CCI), minimal Glasgow Coma Scale (GCS_min), blood urea nitrogen (BUN) and admission_age were the five most important features affecting RF model. Therefore, RF has good predictive ability for 28-day mortality prediction in SALI. Urine output, CCI, GCS_min, BUN and age at admission(admission_age) within 24 h after intensive care unit(ICU) admission contribute significantly to model prediction.


Introduction
Sepsis is a syndrome of multiple organ dysfunction caused by an abnormal immune response to infection [1], and being one of the common diseases in intensive care units (ICU), it has been an important global health problem.The Global Burden of Disease Study, published in 2020, analyzed global, regional and national sepsis incidence and mortality rates from 1990 to 2017 and reported that there were approximately 48.9 million cases of sepsis in 2017, with about 11 million sepsis-related deaths, accounting for 19.7% of all deaths worldwide [2].High risk of rehospitalization and high cost of treatment for sepsis [3,4].In the United States, sepsis was the most expensive condition treated, amounting to $38.2 billion or 8.8% of aggregate costs for all hospital stays in 2017 [5].
The liver is a vital organ for the human body which regulates the balance of metabolism and immunity [6,7].The liver is essential for regulating immune defense during sepsis and the mechanisms it is involved with are lipopolysaccharide detoxification, bacterial clearance, acute-phase protein and cytokine release, inflammation metabolic regulation, etc. [8] The production of large amounts of endotoxins and the release of inflammatory factors in sepsis lead to abnormal immune responses that impair the function of multiple organs, including the liver [9].When there is an inappropriate immune response or excessive inflammation in the liver, the ability to clear pathogens is impaired and liver metabolism is disrupted.Sepsis associated liver injury (SALI) can be caused by a variety of factors, including pathogens or shock, an exaggerated inflammatory response, persistent microcirculation failure, or even oxidative stress [10].There are two main manifestations of SALI: ischemic hypoxic liver injury and sepsisrelated cholestasis.There are no unified diagnostic criteria for SALI, and the Surviving Sepsis Campaign (SSC) Guidelines recommended to use total bilirubin(TBIL) >2 mg/dL and international standardized ratio (INR) >1.5 as the diagnostic criteria [11].In the assessment of the severity of disease, Sequential Organ Failure Assessment (SOFA) [12], Oxford Acute Severity of Illness Score (OASIS) [13], Acute Physiology Score III (APS III) [14], Logistic Organ Dysfunction Score (LODS) [15], Simplified Acute Physiology Score II (SAPS II) [16], Systemic Inflammatory Response Syndrome (SIRS), and Glasgow Coma Scale (GCS) were some traditional scorings of disease severity.
Studies have shown that the incidence of SALI in the U.S. adult sepsis population is 34% and 46%, which is considered as an independent risk factor for death from sepsis, and that patients who develop SALI have an increased risk of death of nearly 54% [9,17].The high mortality rate of SALI may be related to the lack of effective diagnostic tools and early warning systems.The aim of this study is to develop an explicable machine learning model that can predict the 28-day mortality of SALI early, provide early warning for SALI, and remind clinicians to conduct effective clinical interventions in patients to reduce their 28-day mortality.

Data source
This is a retrospective cohort study based on data the extracted from two open databases at the same center, including critical care databases v2.

Data preprocessing
All data processing was done in the R or python environment.First, the cohorts from the two databases were divided into either the death group or the survival group (Patient outcome defined as 1 for death and 0 for survival), and the differences in each of the clinical features between the two group were compared.Second, we conducted a missing value analysis (S3 Table ) and removed features with missing values exceeding 20%.Third, we used multiple interpolation to interpolate features with less than 20% missing values.The data overlap before and after interpolation were good, and the distribution of the original and interpolated data is shown in S1 Fig.Then, based on the Lasso-CV method with an optimal regularization parameter of 0.113 for feature screening of the interpolated data after excluding SIRS, SOFA, OASIS, SAP-SII, LODS, and APSIII, which are comprehensive scores that can comprehensive assessment the severity of the disease, 24 features were ultimately selected to develop the model (S4 Table ).

Model development and validation
The data extracted from MIMIC-IV were randomly divided into training and internal validation sets according to 7:3, and the data from MIMIC-III that did not overlap with MIMIC-IV were used as the external validation set.We chose the following eight models in the training set for model training: Random Forest (RF), Logistic Regression, Decision Tree, Extreme Gradient Boost (XGBoost), K Nearest Neighbor Model (KNN), Support Vector Machine (SVM), Network for Generalized Linear Models in which the best model is selected by cross-validation (CV_glmnet), and Linear Discriminant Analysis (LDA).Internal and external validation sets were used to test the performance of the model.We used area under the curve (AUC), accuracy, precision, recall, and specificity to evaluate the performance of the models, and the most important of these indicators was AUC.The optimal model was compared with the traditional clinical disease severity scores (SIRS, SOFA, OASIS, SAPSII, LODS, and APSIII) to better predict the 28-day mortality risk of patients with SALI, in order to alert the clinicians to make early interventions.We hyper-parameterized the optimal model to obtain the optimal performance of the model.

Model explainability
The Shapley additive explanations (SHAP) method was used to improve the interpretability of the final model.SHAP is a machine learning interpretation method that can be used to interpret the importance of features in model prediction results [21].It is based on the concept of Shapley value in cooperative game theory and uses an additive method to calculate the contribution of each feature to the model prediction results.The SHAP algorithm can provide an explanatory value for each feature, indicating the degree of influence of the feature on the model's prediction results, and the results of the calculation can be used to explain not only the feature importance of individual predictions, but also the feature importance distribution of the entire dataset.

Statistical analysis
Values are expressed as medians (interquartile range) for continuous variables and totals (percentages) for categorical variables.The rank sum test was used for continuous variables and the Chi-square test for categorical variables.After data preprocessing and feature selection, we developed eight popular machine learning models to predict 28-day mortality in patients with sepsis-related liver injury.The overall performance of each model was evaluated on their AUC, accuracy, precision, recall, and specificity.The best performing model was interpreted using Shapley values.
All calculations and analyses were performed using R 4.2.1 and Python 3.7 software.All statistical tests were 2-sided, and P values<0.05were considered to be statistically significant.

Baseline characteristics
There was a total of 73,181 records in MIMIC-IV, and after screening the records based on the inclusion and exclusion criteria, 710 records were finally obtained.Of these, 497 cases were used as the training set and 213 cases were used as the internal validation set.MIMIC-III (2001-2008) included 28,391 records, with 333 patients ultimately included as an external validation set.The flow chart of this study is shown in Fig 1 .Table 1 shows the baseline

Model development and validation
Twenty-four features were screened for model construction (S4 Table ).The features coefficients were plotted in Fig 2 .A positive value of the coefficient of identity indicates a positive effect on 28-day mortality, while a negative value of the coefficient indicates a negative effect.
We selected the top three models in terms of AUC value for Decision Curve Analysis (DCA), and RF remained the best performing model among them (Fig 4)  3 compared performance evaluation of RF and traditional disease severity score in the internal validation set.RF was the optimal model for predicting 28-day mortality in patients with SALI.We also compared the predictive performance of the models in the external validation set and the results are shown in S6 and S7 Tables.Hyperparameter tuning resulted in better predictive performance of the model, and

Model explainability
To improve the clinical utility of the model, we used the SHAP method to determine which features contribute to the model's prediction of 28-day mortality in patients with sepsis, which is shown in Fig 6.Based on the summary plot of SHAP, we further derived the top 5 influential SHAP dependency plots to explain the effect of clinical characteristics on the risk of 28-day death (Fig 7).The vertical axis of the SHAP dependency plot is the SHAP value of the clinical characteristic, while the horizontal axis is the range of variation of the clinical characteristic, where a SHAP value higher than zero indicates that the patient has an increased risk of 28-day death.

Discussion
We developed and validated a predictive model using a large dataset in order to build a valid, stable, and interpretable model to predict 28-day mortality in patients with SALI.Among the multiple models developed, RF was the most reliable and stable and had the best predictive performance.We compared RF with the traditional disease severity scores (SIRS, SOFA, OASIS, SAPSII, LODS, and APSIII) and found that the RF model was still performed the best.
An external validation of the model was performed, confirming the stability of RF.To date, no researchers have developed a predictive model for 28-day mortality in patients with SALI, and no studies have used multi-model screening for optimal model development.Some researchers have used nomogram to predict in-hospital mortality and 90-day mortality in patients with SALI, but they have only compared the developed model with some of the traditional disease severity scores [22,23].We also performed hyperparameter tuning of the model after developing the optimal model to optimize the predictive performance of the model [24,25].In addition, we screened the five clinical features that contributed most to the model, which were Urine output, CCI, GCS, BUN, and admission_age.Therefore, clinical features can serve as early warning.Shapley value was used to explain the opacity of the model [26].Model opacity refers to the opacity of the intermediate process between the input of data and the output of results [27,28].From Fig 7A, Shapley value was 0 when the urine volume was about 1000 ml within 24 h after admission, and the Shapley value decreased, which showed a negative effect on 28-d mortality in SALI, when the urine volume increased.The GCS value for a Shapley value of 0 is approximately 10 from Fig 7C, and its effect on 28-d mortality in SALI is consistent with urine output.However, the effect of CCI, BUN and admission_age on 28-day mortality in SALI were opposite to the trend of the first two features.The Shapley values tended to approach 0 when the CCI, BUN and admission_age were about 6, 22 and 65, respectively.It is well known that patients with severe sepsis have severely impaired microcirculation and reduced end-organ tissue perfusion, exacerbating organ damage.Urine output is one of the traditional indicators of tissue perfusion that can be used to assess microcirculation [29].A study by Heffernan et al. on the relationship between urine output and mortality in critically ill patients showed that a urine output threshold of less than 0.5 mL/kg/hr moderately predicted mortality in ICU inpatients [30].This serves as a reminder to clinicians that they need to focus not only on the total amount of urine output, but also on changes to urine output over time to detect changes in the patient's microcirculatory concerns in a timely manner.Bun is one of the indicators used to assess kidney function.wen et al. showed that bun greater than or equal to 21 mg/dl is one of the most important predictors of mortality risk in patients with sepsis, which is almost consistent with our results [31].The CCI, developed in 1987, is considered the gold standard for assessing comorbidities in clinical studies [32], as a tool used to predict long-term mortality in patients [33].Previous studies have also shown an increase in patient mortality with increasing CCI [34,35], consistent with our results.We usually use the GCS which is a scale used to assess a patient's level of consciousness [36].Lai Q et al. incorporated GCS into a model construct to assess in-hospital mortality in patients with sepsis.Our model and that of Lai Q et al. consistently show that GCS is an important clinical indicator in predicting the risk of death in patients with sepsis [37].As for the admission_age, as people aging, their bodily functions gradually deteriorate, and the functioning of their organs diminishes.This may explain why admissio-n_age was one of the top five important predictors of 28-day mortality in patients with sepsis-related liver injury.
We found that the top 5 metrics that had the greatest impact on predicting performance were not liver function-related metrics.The ALBI grade is a new score for assessing liver function, which was developed by Dr. Philip J. Johnson, Professor of Translational Oncology at the University of Liverpool, UK [38].However, due to too many missing values, more than 40% (S4 Table ), it was excluded when incorporating the clinical features used to construct the model, and in Table 1, no statistically significant difference between the two groups of ABLI in the SALI death group and survival group.Moreover, liver function related measurements, except ALP death group was significantly higher than the stock group, other measurements of the two groups of patients were not significantly different, in Table 1.SALI is a hepatic impairment caused by sepsis, usually accompanied by other organ injuries, only liver function impairment-related indexes are not sufficient to represent the overall severity of this group of patients, and there is no significant difference between liver function-related indexes in the death group and the survival group, they may be the reason for the absence of indicators of liver injury among the five most important indicators affecting the predictive performance of the model.
There are some limitations in our study.First, our modeling used a single-center dataset and was a retrospective study; In addition, non-overlapping dataset with MIMIC-IV in MIMIC-III was used as an external validation queue, and the chronology was not forwardlooking; Third, we focused only on the clinical indicators within 24 h after ICU admission and did not assess the impact of changes in the clinical features on the outcomes during the ICU stay.Therefore, further design of multicenter prospective studies is needed to validate our findings.

Fig 1 .
Fig 1.The flow chart of this study.A. Screening Process for MIMIC-IV.B. Screening Process for MIMIC-III.https://doi.org/10.1371/journal.pone.0303469.g001 . The RF model showed better predictive performance when compared to the traditional disease severity scores (SIRS (AUC.0.53 (95% CI: 0.45-0.60)),SOFA (AUC.0.62 (95% CI: 0.55-0.70)),OASIS (AUC.0.62 (95% CI: 0.55-0.70)),SAPSII (AUC.0.61 (95% CI: 0.53-0.69)),LODS (AUC.0.65 (95% CI: 0.58-0.73)),and APSIII (AUC.0.61 (95% CI: 0.54-0.69)).The ROCs are shown in Fig 5, and Table Fig 6A shows the distribution of the SHAP values of the top 20 clinical features: each point in the figure represents a feature, and the position of the point indicates the SHAP value of the feature, with the value representing the magnitude of the feature's contribution to the model output.If the value is positive, the feature positively influences the output; if the value is negative, the feature negatively influences the output.Red color indicates high values and blue color indicates low values.A darker color indicates a stronger influence of the feature on the target feature.Fig 6A shows a low SHAP value for urine output and GCS_min that indicated a positive influence on 28-day mortality, while the CCI, BUN and admission_age displayed an opposite trend.The bar chart was formed by ranking the features from high to low according to their average SHAP absolute values, indicating the degree of the contribution of each feature to the whole model.The larger the SHAP absolute value is, the more important the feature is, and the greater impact it has on the model output results.From Fig 6B, it is easy to see that the top five clinically important features were urin output, CCI, GCS_min, BUN and admission_age.

Fig 6 .Fig 7 .
Fig 6.SHAP summary chart.A. SHAP values showing the influence of different features on the output of RF Model.B. Mean absolute SHAP values for each clinical feature.https://doi.org/10.1371/journal.pone.0303469.g006